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Abstract We study the rates of estimation of finite mixing distri¬ 
butions, that is, the parameters of the mixture. We prove that under 
some regularity and strong identifiability conditions, around a given 
mixing distribution with mo components, the optimal local minimax 
rate of estimation of a mixing distribution with m components is 
^-i/(4(m-mo)-i-2)^ This corrects a previous paper by Chen (1995) in 
The Annals of Statistics. 

By contrast, it turns out that there are estimators with a (non- 
uniform) pointwise rate of estimation of for all mixing distri¬ 

butions with a finite number of components. 


1. Introduction. Finite mixture models go back to the work of Pearson 
(1894) who studied biometrical ratios on crabs. As a flexible tool to grasp 
heterogeneity in data, these models have emerged and successfully been ap¬ 
plied in various fields including astronomy, biology, genetics, economy, social 
sciences and engineering. A general introduction as well as a brief history 
can be found in the book of McLachlan and Peel (2000). 

There are essentially three cases where finite mixtures and their estima¬ 
tion naturally arise. One actively investigated topic is model-based cluster¬ 
ing. Here the aim is to divide the data into k clusters and assign (new) data 
to a cluster. A possible approach is to consider that each data point from a 
cluster is generated according to a density probability known up to a few pa¬ 
rameters, so that the whole data is generated by mixture with k components 
(McLachlan and Peel, 2000; Teh, 2010). 

The second, more traditional case, is the statistical description of possibly 
heterogeneous data where the underlying mixing distribution has no partic¬ 
ular meaning. In that case, mixtures are a tool to describe efficiently the 
“true” probability distribution. The goal is then to control the convergence 
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rate of mixture estimators to this “true” probability measure (van de Geer, 
1996; Ghosal and van der Vaart, 2001; Genovese and Wasserman, 2000). 

In the third case, we are interested in the mixing distribution itself, that 
is the parameters of the mixture. The support points and proportions are 
the parameters we want to estimate. Typically, they correspond to the phe¬ 
nomenon that is studied, but we only observe data points distributed under 
the probability distribution corresponding to the mixture. This is the case 
we are interested in. 

Notice that some works try to bridge the gap between the estimation of 
the mixture, and the estimation of the mixing distribution, usually at least 
through estimation of the number of components - the order - in the hnite 
mixture. In particular, Rousseau and Mengersen (2011) have proved that 
their Bayesian estimator of the mixture tends to empty the extra compo¬ 
nents, and Gassiat and van Handel (2013) have given the minimal penalty 
on the maximum likelihood estimator that yields strong consistency on the 
order. 

One could expect that a good estimator for the mixture would be a good 
estimator for the mixing model. However, this is not so clear. The situation is 
reminiscent of the difference between estimation and identihcation in model 
selection, where Yang (2005) has proved that no procedure can be optimal 
for both. Moreover, rates of convergence can be very different, as illustrated 
in an inhnite-dimensional case by Bontemps and Gadat (2014). 

When the aim is to estimate the mixture parameters, optimal rates are 
a key information. These were unknown (see e.g. Titterington, Smith and 
Makov, 1985) till the work of Ghen (1995), who established a local min¬ 
imax rate, under reasonable identihability conditions, for one-dimensional- 
parameter mixtures. 

This result is somewhat surprising, since the rate does not depend on the 
number of components. In particular, as a rule of thumb, if a continuous 
parameter (here, the rate exponent) is constant for all big integers, it is the 
same in the inhnite case. However, mixtures with an inhnite number of com¬ 
ponents can only be estimated at a non-parametric rate in general. Indeed, 
deconvolution may be viewed as a special case of an infinite mixture problem: 
estimating the mixing distribution of the shifts of the probability measure of 
the noise. However, Fan (1991) had proved that the L^-convergence rate was 
(a power of) logarithmic in general, and Gaillerie et al. (2013) and Dedecker 
and Michel (2013) have generalized this kind of rates to different Wasserstein 
metric, including the L^-Wasserstein metric. The latter is the one used by 
Ghen (1995). 
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A possible explanation could have been a constant in front of the rate that 
would explode with the number of components. It turns out, however, that 
the result by Chen (1995) is erroneous. 

Let us be more specific. In his Theorem 1, Chen (1995) proves an 
lower bound on the local minimax rate. Lemma 2 provides a control on a 
power a of the transportation distance between two mixing distributions by 
the L°°-distance between the corresponding probability distribution func¬ 
tions. This control is uniform on all pairs of mixing distribution in a ball 
around a mixing distribution Gq. This uniform control entails (Theorem 2) 
an upper bound on the local minimax rate of estimation. 

The exponent a in Lemma 2 was equal to 2, so that the lower and upper 
bounds coincide. However, Lemma 2 and its proof contain an error: forgetting 
that distinct components can converge to the same one. Our article aims at 
giving correct statements and proofs for this Lemma 2 and its consequences. 

The main part consists in finding the correct a; that is Theorem 3.3. 
Theorem 3.2 gives the matching lower bound, so that the local minimax rate 
is established. 

Interestingly, another way to correct Lemma 2 is by restricting the pairs 
of mixtures that are compared. Namely, instead of comparing all pairs of 
mixtures in a ball around Gq, we allow only comparison of a mixture in the 
ball with the ball center Gq. Then a = 2 is valid. We give the corresponding 
statement in Theorem 4.7. Translated to Theorem 2 of Chen, this corre¬ 
sponds to dropping uniformity. That is, for any fixed G, the same estimator 
will converge at rate but the constant depends on G: this is a bound 

on pointwise rate everywhere, instead of a bound on local minimax rate. 

Thus the optimal local minimax rate and the optimal pointwise rate of 
estimation everywhere do not coincide. This discrepancy is not very usual in 
statistics, and often a source of confusion. To make things a little clearer, we 
also establish the optimal pointwise rate in Theorem 3.5. Since Theorem 1 
of Chen (1995) is a bound on local minimax rate, the pointwise rate might 
be better. And indeed, the optimal pointwise rate everywhere is 

The paper by Chen (1995) has been widely cited and used. Apart from 
applied papers citing it that may have relied on the theoretical guarantees 
(see e.g. Kuhn et ah, 2014; Liu and Hancock, 2014), there are essentially two 
ways it could play a role. Firstly, when it is used as part of a proof, secondly 
when it is used as a benchmark. 

The first case covers papers that generalize Chen’s result in other settings, 
and re-use its theorems and proofs. For example, Ishwaran, James and Sun 
(2001) propose a Bayesian estimator that achieves the frequentist rate. 
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and use Chen (1995, Lemma 2) in their analysis. More recently, Nguyen 
(2013) generalizes those results to mixtures with an abstract parameter space 
and indehnite number of components. However his Theorem 1 generalizes 
Chen (1995, Lemma 2) while transposing the proof with the mistake. The 
main results of both these articles hold however: they do not need the full 
strength of Chen (1995, Lemma 2), but merely the weaker version Theorem 
4.7. 

These two papers also use Chen’s (1995) article as a benchmark. However, 
the optimal pointwise rate everywhere would probably be a better reference 
point in their case, as in many others. In particular, it seems likely that a 
Bayesian estimator could converge pointwise at speed everywhere. We 

have not checked whether the proof by Ishwaran, James and Sun (2001) can 
be improved, or if another prior is necessary. 

This use as a benchmark is very usual, as expected for this kind of opti¬ 
mality result (see e.g. Zhu and Zhang, 2006, 2004). Let us point in particular 
to a result by Martin (2012). He achieves almost rate for the predictive 

recursion algorithm, and tries to explain the discrepancy with Chen (1995) 
by the fact that the parameters are constrained to live in a hnite space for 
his algorithm. In fact, since his rate is pointwise, it hts with the continuous 
case. 


In Section 2, we give the notations and dehne and discuss the regularity 
assumptions we use. In Section 3, we state and discuss the main theorems, 
giving the optimal local minimax rate and pointwise rate everywhere. We 
try to give some intuition. We also dwell on the interpretation and practical 
consequences of having different rates, and conclude the section with open 
questions. In Section 4, we give and explain the meaning of the key interme¬ 
diate results and prove the main theorems from here. In Section 5, we prove 
those key intermediate results. In particular, in Section 5.1, we introduce the 
most original tool of our proofs: the coarse-graining tree that allows to patch 
the mistake in the article by Chen (1995). 

Some auxiliary and technical results are detailed in appendices grouped 
in a supplemental part (Heinrich and Kahn, 2015). 

2. Notations and regularity conditions. 

2.1. General notations. Throughout the paper, the family {fix,6)}g^Q 
will consist of probability densities x i-A f{x,9) on M with respect to some 
iT-hnite measure A. The parameter set 0 is always assumed to be a compact 
subset of M with non-empty interior. We write Diam 0 for its diameter. Given 
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an m-mixing (or m-points support) distribution G on 0, a finite mixture 
model with m components is defined by 

(1) /(x,G)= / f{x,e)dG{e). 

Je 

The set of such m-mixing distributions G is denoted by and G^m will be 
the union of Qj for j G |1, m]. Similarly, the set of finite mixing distributions 
is denoted by G^oo- For two mixing distributions Gi and G 2 , note that by 
linearity f{x,Gi — G 2 ) = f{x,Gi) — f{x,G 2 )- This will be used to shorten 
expressions. 

In what follows || • ||oo is the supremum norm with respect to x and || • || 
is any norm in finite dimension. Throughout the paper, the variable x plays 
no role, and we often write The p-th derivative f^P\x,9) is always 

taken with respect to the variable 9. 

We write Fn for the empirical distribution, that is, if Xi,... are in¬ 
dependent with distribution F{-,G), then Fn{t) = ^ 

As usual, the (L^)-transportation distance, or Wasserstein metric, is used 
to compare two mixing distributions Gi and G 2 . It completely bypasses 
identifiability issues that would arise with the square error on parameters. 
The definition is: 

(2) W(Gi,G 2 ) = inf / \9-9'\dU{9,9'), 

n Jexe 

where the infimum is taken over probability measures 11 on 0 x 0 with 
marginals Gi and G 2 . By the Kantorovich-Rubinstein dual representation 
(e.g. Dudley, 2002, section 11.8), W{Gi,G 2 ) can be viewed as a supremum: 

(3) W(Gi,G 2 )= sup / fi9)d{Gi-G2m, 

l/lLip^si-ie 

where |/|Lip stands for the Lipschitz seminorm of /. Endowed with the metric 
W, the space Gs^m is compact. It is sometimes convenient to use the notation 
W{Gi — G 2 ) instead of W{Gi,G 2 )- 

We also introduce the Wasserstein e-ball of a mixing distribution Gq: 

WG,{e) = {G G g <00 : W(G,Go) <e}. 

In the rest of the paper, we will need to compare sequences, say (a^) and 
{bn)- The notation ^ bn (or even a ^ 6 if n is kept implicit) means 
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that there is a positive constant C such that an ^ Cbn] in other words, 
o-n = 0{bn). We will also use bn for bn =4 o^n ^ bn- If we need to stress 
the dependence of the constants C on other parameters, say C = C{u,v,6), 

we will write ^ bn or bn- 

u,v,d u,v,e 

d P 

Below —)■ (resp. —>) stands for convergence in distribution (resp. in prob¬ 
ability). We write |i, j] for the set of integers between i and j. 

2.2. Regularity: (p,q)-smoothness. It is notationally natural to set 

(4) F{x,e)= r f{y,9)dXiy), 

J —oo 

and to denote by Eg the expectation w.r.t. /(x, 0)dA(x). If we identify 6 with 
the Dirac measure 6$, the notations extend naturally to mixing distributions 
G by linearity. 

Recall that derivatives are taken w.r.t. the variable 6. 

Definition 2.1. Set /or p G N and q> 0, 

(5) Ep^g{9,9',9") =Eg 

We say that {f{-,9),9 G 0} is (p, ( 7 )-smooth if 

1. Ep^q is a well-defined [0, oo]-ra/rted continuous function on 0^, 

2. There exists e > 0 such that 

\9' - 9”\ < e ^ G 0, Ep^q{9,9', 9”) < oo. 

These smoothness conditions are easy to check in practice, and general 
enough. For example, all exponential families satisfy them, as shown in Sec¬ 
tion D.2 in the supplemental part (Heinrich and Kahn, 2015). 

They will be useful for proving local asymptotic normality (Le Cam, 1986) 
of relevant families. 




2.3. Regularity: k-strong identifiability. Chen (1995) introduced a notion 
of strong identifiability. We will need a slightly more general version. 


Definition 2.2. The family {F{-,9),9 G 0} of distribution funetions is 
/c-strongly identifiable if for any finite set of say m distinct 9j G 0, then the 
equality 


k m 


P=0 j=l 
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implies apj = 0 for all p and j. 

Chen’s strong identifiability corresponds to 2-strong identifiability. Let us 
exemplify why this notion is useful. Consider a sequence of mixing densi¬ 
ties Gn = ^{5^-1 + Then, if we can develop around 0 = 0, we see 

that F{-,Gn) = F{-,0) + + o{n~‘^). Then 2-strong identifiability 

ensures that ||T(-,0) — T(-,Gn)||oo is of order n“^, as shown in Proposi¬ 
tion 2.3 below, whereas simple identifiability would say nothing. We will 
need /c-strong identifiability when more moments in 9 w.r.t. the two mixing 
distributions are the same. 


Proposition 2.3. Fix m ^ 1. Let {F{-,0),9 G 0} be k-strongly identi¬ 
fiable. For e > 0, the 9i are e-separated if they belong to 

Ve = {{9i)i^i^j^ ^ %-eAfie]. 

(x, 0) is continuous in 9, then 

k m 


( 6 ) 




p=0 j=l 


)p= || q ;| 
e 


Proof. Set a = (ap,j) and d = (fif)- The [0, oo]- valued function (a, fi) 

is lower semi-continuous on the compact set 

^ ^ OO 

{a : ||q;|| = 1} xT>^ so that it admits a minimum. By /c-strong identifiability, 
it is nonzero. □ 


We expect the strong identifiability to be rather generic, and hence the 
statements of this paper often meaningful. In particular, Chen (1995, Theo¬ 
rem 3) has proved that location and scale families with smooth densities are 
2-strongly identifiable. The theorem and the proof straightforwardly gener¬ 
alise to our case. We merely state the result. 

Theorem 2.4. Let k 1. Let f be a probability density with respect to 
to the Lehesgue measure on M. Assume that f is k — 1 times differentiable 
with 

lim (x) = 0 for p G |0, /c — 1]. 

a;—>-±oo 

Consider f(x,9) = f{x — 9), with 0 G 0 C M. Then the corresponding 
distributions family {F{-,0),0 G 0} is k-strongly identifiable. If & G (0,oo), 
the result stays true with f{x,0) = ^f (|). 

See also the article by Holzmann, Munk and Stratmann (2004) for more 
general conditions, that also generalize well to /c-strong identifiability. 
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2.4. Assumptions. For proving lower bounds on rates, we will assume: 

Assumption A. The family of densities {f{-,d),9 G 0} satisfies, with 
Go £ Gmo, 

• {p, q)-smoothness for all {p,q) G |1, 2(m — mo) + 2] x |1,4], 

• There is some support point 9o of Go such that 9o £ & and 

J > 0. 

These conditions allow to prove local asymptotic normality (Le Cam, 
1986) for relevant families. This will give some insight on the reason why 
the lower bound on the rate holds, and on how the mixtures behave when 
we change the parameters in the least sensitive direction. The condition on 
the support point guarantees identihability locally for the families, and we 
need more derivatives than usual, since there will be cancellations in the hrst 
terms. 

For proving upper bounds on rates we will assume: 

Assumption B(A:). The family of densities {f{-,9),9 G 0} satisfies, 
with F{x, 9) = /(•, 6l)dA, 

• For all X, F{x,9) is k-differentiable w.r.t. 9, 

• {F{-,9),9 G 0} is k-strongly identifiable, 

• There is a uniform continuity modulus uj{-) such that 

sup \F^^\x,92) - fW(x, 0i)| ^ 00(92 - 9i) 

X 

with l[inh^QU}(h) = 0. 

Notice that the latter condition is satished if exists and is bounded. 

These derivability conditions should be compared with the usual paramet¬ 
ric case, where differentiability in quadratic mean, or twice differentiability 
in 9 for a less technical condition, is enough to get local minimax rate. 

We will need B(2m) to prove a global minimax rate of and B{1) 

for a pointwise rate of everywhere. 

3. Main results. We now have the tools to state the main results. 

Keeping in mind the following viewpoint will help getting intuition on the 
results. The data we have access to is the empirical distribution Fn, which 
gets closer to the true mixture F(-,G) at rate Hence Gi and G 2 can 

be told apart if ||F(-,Gi) — F(-,G 2 )\\,^ is at least of order 
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If we get a control on powers of the transportation distance W{Gi,G 2 )^ 
by ||F(-, Gi) —-F(-, G 2 )||gQ, we then get rates. For upper bounds, 

Lemma 4.5 makes this rigorous. 

For lower bounds, general estimators could hope to do better, say by 
noticing that some data points are not in the support of some Gi. However 
this will not be the case under sufficient smoothness conditions. 

In this setting, the minimum distance estimator discussed by Deely and 
Kruse (1968) and Chen is natural, and we often use it later on. 

Definition 3.1. The minimum distance estimator Gn G G^m is any 
mixing distribution whose eorresponding mixture minimizes the -distance 
to the empirical distribution, that is: 

(7) \\Fi;Gn)-Fn\\oo= mf ||F(-, G) - F„||oo. 

Geg^m 

Note that the infimum is attained since G i—)• ||K(-,G) — Fn\\oo is lower semi- 
continuous on the compact metric space {G^rmW). 


3.1. Local asymptotic minimax rate. When the number m of components 
in a mixture is exactly known and f{-,6) is smooth in 0, we are in a simple 
smooth parametric case, with 2m — 1 parameters. Hence the optimal local 
minimax rate of estimation is in mean square error, with a constant 

given by the Cramer-Rao bound (Hajek, 1972). This translates to the same 
rate in transportation distance. 

In particular the minimum distance estimator introduced above attains 
the rate (Theorems 4.8 and 3.5), not necessarily with the optimal 

constant. 

The difficulty with mixtures stems from what happens when the number 
of components is not known: is there only one component here, or two very 
close ones? If there are two, what are their weights and how far apart are 
they? 

We can build families of mixtures that are very hard to tell apart, because 
their mixing distributions have the same first moments. Indeed, suppose that 
all the support points of the mixture are of the form 9q -\- hj with hj small. 
Then a Taylor expansion in 9 of the mixture F{-,G) yields: 


p=0 j 


TTjh^.) 


F(P\;9o) 


p\ 


+ o{Trjhj). 
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So that, according to our heuristics on the empirical distribution, if Gi and 
G 2 have the same k first moments, we cannot tell them apart if <C 
that is if hj <C 

As an example, let us consider two-component mixtures around Gq = Sq. 
Then Ggn = ^ (( 5 _ 2 n-l /6 +<^ 2 n-l/ 6 ) and G2,n = |5_„-i/ 6 -h |(^ 4 n-i /6 both 
have 0 as first moment, and as second moment. The third moments 

are respectively zero for Gi^n and for G 2 ,n- According to this heuris¬ 

tics, no test can reliably tell Gi^n from G 2 ,n with an n-sample. On the other 
hand, we clearly have W{Gi^ri, G 2 ,n) = for all n. So that the minimax 

rate for 2-mixtures cannot be better than 

This moment matching argument can be made rigorous and precise with 
two tools. One is Lindsay’s (1989) Hankel trick (Theorem 4.2), also used by 
Dacunha-Castelle and Gassiat (1997) to estimate the order of a mixture. The 
other is local asymptotic normality (Definition 4.1), developed by Le Cam 
(1986). We use them to build a one-parameter locally asymptotically normal 
family with scale factor 74i/(4(m-mo)+2) Theorem 4.3, which will entail: 


Theorem 3.2. Let Gq G Qmo set = f7,-i/(4(m-mo)+2)+K 
K > 0. Under Assumption A, for any sequence of estimators Gn based on 
i.i.d. n-samples, 


lim inf sup n 
GieOmnWooUG 


l/(4(m-mo)+2) 


EGi 


W{Gi,Gr, 


> 0 . 


Theorem 3.2 gives a lower bound on the local asymptotic minimax rate 
of estimation. The corresponding upper bounds, both local and global, are 
given by the following theorem: 


Theorem 3.3. Let Gq G Gmo- Then, under Assumption B(2m), there is 
an £ > such that the minimum distance estimator (7) in Q^m satisfies 


( 8 ) 


Egi 


W{Gn,Gi) 


1 

^ j^l/(4(m-mo)+2) 


uniformly for Gi in G^m G where n is the sample size. 

Moreover, uniformly for Gi in G^m, 


(9) 


Egi 


W(Gn,Gi) 


1 

^ rAlAtn-‘£) ■ 


We prove it by establishing a uniform control of lT(Gi,G 2 )^™' ^mo+i ]^y 
||F(-,Gi) - F(-,G 2 )||oo ™ Theorem 4.6. 
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Obtaining this control is quite technical, however. To do so, we consider 
sequences of couples (Gi^n;G' 2 ,n) minimizing the relevant ratios, and express 
F{-,Gi^n) — F{'-,G 2 ,n) as a sum on their components F{-,9j^n) and relevant 
derivatives. A difficulty arises: distinct components Oj^n may converge to the 
same 6j, leading to cancellations in the sums. Forgetting this case was the 
mistake by Chen (1995) in the proof of their Lemma 2. We overcome the 
issue in Section 5.1 by using a coarse-graining tree: each node corresponds 
to sets of components whose pairwise distance decrease at a given rate. We 
may then use Taylor expansions on each node and its descendants, while 
ensuring that we keep non-zero terms (Lemma 5.2). 

Remarks 3.4. • Theorems 3.2 and 3.3 together imply that the opti¬ 
mal local asymptotic minimax rate is for estimating a 

mixture with at most m components around a mixture with mo eompo- 
nents. 

• The rate is driven by m — mo, that is, it gets harder to estimate the 
parameters of a mixture when it is close to a mixture with less compo¬ 
nents. 

• The worst case is when mo = 1, yielding a global minimax rate of 

estimation The rate gets worse when more components are 

allowed. So that the nonparametric rates for estimating mixtures with 
an infinite number of components like in deconvolution appear natural. 

• On the other hand, when the number of eomponents is known, that is 
m = mo, we have the usual local minimax rate _ 

• The global minimax rate on the mixtures with exactly m components 
Qm stays at 7),-i/(4m+2)^ because Gm is not compact, and Theorem 3.2 
still apply in the vieinity of mQ-eomponent mixtures. 

The slower rate 77 ,- 1 /( 4 ^+ 2 ) Rg a little surprising when for example 

some Bayesian estimators have rate of convergence (Ishwaran, James 

and Sun, 2001). However this convergence rate is not the local minimax rate, 
but is closer to a pointwise rate of convergence, that is the speed at which 
an estimator converges to a fixed G when n increases. The difference with 
local minimax may be viewed as the loss of uniformity in G. We now study 
the optimal pointwise rates everywhere. 

3.2. Pointwise rate and superefficiency. One motivation for local mini¬ 
max results was to make clear how the Hodges’ estimator (van der Vaart, 
1998, ch. 8 ) and other superefficient estimators could cohabit with Cramer- 
Rao bound, and how much they could improve on it. 
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Specifically, a superefficient estimator can have a better pointwise con¬ 
vergence rate than any regular estimator, but not a better local minimax 
convergence rate (Hajek, 1972). Moreover, it turns out that they can only 
have a better pointwise rate on a Lebesgue-null set (van der Vaart, 1998, 
ch.8). 

Now, the set of parameters of mixtures with less than m components 
is a Lebesgue-null set among those of mixing distributions with at most m 
components Q^m- Hence, we might expect that, by biasing the estimators 
toward the low numbers of components, we might attain better pointwise 
rates on up to which is the value when the number of components 

is known. By letting m go to infinity, we would have this pointwise rate for 
all finite mixing distributions. It turns out this is indeed the case. 

An estimator achieving rate may be built from minimum distance 

estimators (7). For all m we denote by Gn,m the minimum distance estimator 
in Q^rti- For any fixed n G (0,1/2), set 

( 10 ) Gn = Gn,mi 

with 

(11) rh = = inf |m ^ 1 : \\F{-,Gn,m) - 74||oo ^ . 

Since the typical distance between empirical and cumulative distribution 
functions is this fh is the lowest number of components that is not 

clearly insufficient. 

We will obtain: 


Theorem 3.5. Under Assumption B(l), for any finite mixing distribu¬ 
tion Go G G<oo, 


Ego 


W{Gn,Go) 


=4 n 


1/2 


Notice that the above inequality is not uniform in Gq. 


Remarks 3.6. • The rate cannot be improved since it is the 

rate if the number of components is known beforehand. 

• This is slightly stronger than just checking that we find the right number 
of components and then applying Theorem 3.3, because we need much 
less regularity. Only Assumption B(l) is required, instead of B(2m). 
That is, we do not need more smoothness when the number of compo¬ 
nents increases. Under the hood we rely on the bound in Theorem f.S 
instead of Theorem 4-6. 
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• The estimation of the number of components m and the estimation 
of G within are not associated. For example, we may estimate m 
with Equation (11), and then use the maximum likelihood estimator G 
on Conversely, we may estimate the number of components using 
Gassiat and van Handel’s (2013) penalized maximum likelihood esti¬ 
mator. 

3.3. Interpretation and practical consequences. Disagreement between lo¬ 
cal minimax and pointwise rates everywhere might be rare enough that it is 
worth recalling what it means. 

At a given point G, the asymptotic rate of convergence to G will be the 
pointwise rate G{G)n~^^‘^. However, the estimator will enter this asymptotic 
regime only after a long time. More precisely, it enters this regime after 
that G is not anymore in any of the balls used in the local minimax bound. 
Alternatively, we may view this situation as the constant G{G) exploding 
when G is close to certain Gq. 

In our case, imagine we have a mixture with three components, all within 
distance 5 of Oq. Then about = 5“^'^ data points are necessary 

to get an estimator with an error of 5. 

In particular, if Gi and G 2 are two such three-component mixtures, chosen 
to have the same first four moments, and Gi and G 2 are the same mixtures, 
rescaled to be ten times closer, we will need 10^*^ as many data points to tell 
them apart as for Gi and G 2 . 

As a consequence, if the components of the mixture to be estimated are 
not far apart one from the other, it is quite often impossible to get enough 
data points to get an appropriate estimate. 

An experimentalist with any leeway in what he measures (use of different 
markers, say) might then wish to ensure that the peaks are far apart, even 
at the cost of many data points. 

3.4. Further work. This article contains the proof that the optimal local 

minimax rate of estimation around a mixture with mo components among 
mixtures with m components is when the parameter space 

0 is a compact subset of M. 

We think that extension to a multivariate 0 should be easy enough, much 
like Nguyen (2013) did for the former erroneous result. 0n the other hand, 
non-compactness of 0 would probably bring about technical difficulties, and 
cases where the result would not hold. Stronger forms of identifiability would 
probably be required in general, to avoid problems with limits. 

Finally, another line of inquiry are the results that might be expected in 
a Bayesian framework. The most natural equivalent to the convergence rate 
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of the a posteriori distribution to the real parameter is the pointwise rate 
of convergence. Hence the question: can we build Bayesian estimators where 
the a posteriori distributions converge at rate everywhere? Of course, 

the convergence would not be uniform. 

4. Key tools. 

4.1. Local asymptotic normality and Theorem 3.2. We prove Theorem 3.2 
by displaying local asymptotic families {Gn{u),u G M} with scale factor 
^-i/(4(m-mo)+2) ^ far-from-general definition, but sufficient for our pur¬ 
poses, of local asymptotic normality (Le Cam, 1986) is as follows: 

Definition 4.1. Given densities fn,u (n € N,u € M.) with respect to 
some dominating measure, consider experiments = {fn,u-,u ^Un\ where 
the Un,n G N, are real sets such that each real number be in lAn for n large 
enough. Let X have density fnfi and consider the log-likelihood ratios: 


Suppose that there is a positive constant T and a sequence of random variables 
Zn with Zn A A?(0, r), such that for all tt G M, 

yf p 

(12) Znfi{u) — uZn + —r - > 0. 

2 n^oo 

The sequence of experiments £n is said locally asymptotically normal (LAN) 
and converging to the Gaussian shift experiment {A?(ttr,r),u G M}. 

Note that if X were exactly A?(ttr, r)-distributed, the l.h.s of (12) would 
be zero, with a suitable Z^ exactly A?(0, r)-distributed. In addition, intu¬ 
itively, (almost) anything that can be done in a Gaussian shift experiment 
can be done asymptotically in a LAN sequence of experiments. 

Now, consider a mixing distribution Gq = £ Gmo with mo-th 

support point 6mo in the interior of 0. 

Then, for n big enough, Oj^niu) = Gmo is in 0. The LAN 

family will be 


(13) 


mo —1 m 

Gniu) = ^ ^ + TTmo ^ ^ ^i(^)'^0j,n('i‘)’ 

j=l j=mo 
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with hj{u) and TTj{u) chosen so that the first moments around 6mo are the 
same for all u, that is, for some relevant 

m 

'Kj{u)hj{u)^ = Hk for k < 2(m — mo), 

j=mo 

m 

E , \j2(m—mo)+l 

7rj{u)h-^ =u, 

j=mo 

The reason why they exist is Theorem 2A by Lindsay (1989) on the matrix 
of moments: 


Theorem 4.2. Given numbers 1, ni,..., fi 2 d, write for the k + 1 by 
k + 1 (Hankel) matrix with entries for k G [1, dj. 

a. The numbers 1, fii,, iX 2 d the moments of a distribution with exactly 
p points of support if and only if detM^ > 0 for k G |l,d — 1] and 
det Mp = 0. 

b. If the numbers 1, ;Ui,..., P 2 d -2 satisfies det > 0 for k G |1, d — 1] and 

P 2 d-i is any scalar, then there exists a unique distribution with exactly d 
points of support and those initial 2d — 1 moments. 


With such a family, we can prove the following theorem, whose proof we 
delay to Section 5: 


Theorem 4.3. Let Gq = ^ Qmo be a mixing distribution 

whose mQ-th support point is in the interior of Q. Let m ^ mo. 

Then there are mixing distributions Gn{u) (n ^ 0, n G all in Qm sueh 
that 


a. W{Gniu),Go) —)■ 0 for all u G M. More preeisely, 
W{Gn{u),Go) ^ 

U 


h. 


The mixing distributions Gn{u) get closer at rate n mo)+ 2 ). 

u and u', 

W{Gn{u),Gn{u')) > 

u,u' 


c. If the family {f{-,6),9 G 0} satisfies Assumption A with 9 q = 9mo, then 
there is a number T > 0, a sequence U(n) —)■ oo and an infinite subset Nq 
o/N along which the experiments 


f — 

<-' 71 , - 


l/(•,Gn(^t)),|^t| ^ U{n) 
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converge to the Gaussian shift experiment {J\f{uT,T),u G M}. 

Remark 4.4. We want only an example of this slow convergence, and 
it should be somewhat typical. That is why we have chosen the regularity 
conditions to make the proof easy, while still being easy to check, in particular 
for exponential families. 

In particular, it could probably be possible to lower q in (p, q)-smoothness 
to 2 + e and still get the uniform bound we use in the law of large numbers 
below. Similarly, less derivability might he necessary if we tried to imitate 
differentiability in quadratic mean. 

In the opposite direetion the variance T in the limit experiment is really 
expected to be most cases, but more stringent reg¬ 

ularity conditions may be needed to prove it. 


Theorem 4.3 and its proof show that when the first moments of the com¬ 
ponents of the mixing distribution G near OmQ are known, all remaining 
knowledge we may acquire is on the next moment, and that’s the “right” 
parameter: it is exactly as hard to make a difference between, say, 10 and 11 
as between 0 and 1 . 

On the other hand, for our original problem the cost function is the trans¬ 
portation distance between mixing distributions. So that an optimal esti¬ 
mator in mean square error for u is not optimal for our original problem. 
Moreover just taking the loss function c{ui,U 2 ) in the limit experiment runs 
into technical problems since this might go to zero as U 2 goes to infinity. 
They could be overcome, but it is easier to show how Theorem 4.3 entails 
Theorem 3.2 using just two points and contiguity (Le Cam, 1960). 


Proof of Theorem 3.2. Fix tt > 0 and consider the densities fn,u = 
( 8 )^ 1 /(•, G'n(u)) with associated probability measures Pcnlw)®'*’ which are 
simply denoted by PcnlM) below. We have 

1 

(14) liminf inf Pg (u){^) ^ 

Indeed, from Theorem 4.3.C. and the LAN property (12), if X is of density 
fn,o and n ranging over Nq, then 

Jn,u{X) _,,;7 I nip P , ry d »r/rv 

Pn = -f —T^e 2 1 where 

fn,0[X) 


For any event A, 


IPg„(«)(-4) - Eg„(o) /”'o(X) ~ ^ ^ 1a) • 
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p 

Furthermore, by restriction on the event {Z^ >0} and by using pn —1, we 
get 

EGn(O) (Pne“^'‘lA) ^ IPGn( 0 )(^) “ ^G„{ 0 ){^n ^ 0) + o{n). 

Taking now the infimum on events A such that ^3/4 and passing 

to the limit as n —)■ oo along Nq, we obtain (14). 

We now consider, for any sequence of estimators G^, the event 

A = ^ a} 

for some a > 0 to choose. By Theorem 4.3.b., there is a constant c(tt,0) > 0 
such that Gn(0)) ^ c(tt,0) so that by the triangle’s 

inequality, 

AC c ^ c(n,0) - a}. 

Choose a = c(u,0)/2. Then either Pc^('o)(^) ^ 1/4, which gives 

sup G„) ^ j, 

Gie{G„(0)} 4 

2 

or, by (14), Fq^(^^^(A'^) ^ e“”T'"/4 in the limit so that 

liminf sup ^ 

Gig{G„(«)} 4 

Thus, gathering the two inequalities, we get 

liminf sup ^ 

Gi€{G„(0),G„(«)} 4 

Note to finish that by Theorem 4.3.b., each G„(0) or Gn{u) is at Wasserstein 
distance at most 7^-i/(4(m-™o)+2)+e fQj. large n enough. Theorem 3.2 

is thus established. □ 

4.2. Comparison between distances and upper bounds on convergence rates. 
All our upper bounds on convergence rates come from the properties of the 
minimum distance estimator (7). Now, by the triangle’s inequality, if the 
n-sample comes from F(-,Gi) with Gi G G^m, then 

(15) ||F(-,Gn)-F(-,Gi)||oo ^2||F(-,Gi)-Fj|oo. 

Hence, the following lemma allows us to get bounds on rates whenever we can 
control (a power of) the transportation distance between mixing distributions 
by the L°°-distance between mixtures. 
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Lemma 4.5. Let 1. Assume that the optimal estimators Gn in The¬ 
orem 3.3 satisfy for some constant C > 0 and on some event A, 

W{dn,G,y ^C\\Fn- F{-,Gi)\U 


Then 


( ^2 \ l/2rf 

^ j ^ Diam(0)PGi (^1'^). 


Proof. By assumption, we can bound W{Gn,Gi) on A and we can also 
always bound VP(Gn,Gi) by Diam(©), in particular on A'^^ so that using 
Jensen inequality, 

¥.G,W{Gn, Gi) ^ G^!^ [Egi \\Fn - F(-, GOIloo]'/'" + Diam(0)PGi(^‘^). 

Now, the Dvoretzky-Kiefer-Wolfowitz inequality (Massart, 1990) asserts that 
for any z > 0, 

(16) Pgi (||F(-,Gi) -F^IIoo > z) ^ 2e-2--', 

and consequently, 

EcJlFn - F(-,Gi)||oo ^ ^'"2e-2-"'dz = 

The proof is complete. □ 


The following theorem is the key technical tool for the proof of Theo¬ 
rem 3.3. We describe the main and novel ingredient in its proof, a coarse- 
graining tree, in Section 5.1. 


Theorem 4.6. Let Go G Gmo- Under Assumption B(2m), 
a. There are e > 0 and J > 0 such that 


inf 

Gi ,G2G£7,gmnW(3g (e) 
Gi^Ga 


||F(-,Gi)-F(-,G2)IL 


> J. 


b. There exists J > 0 such that 


\\F{;Gi)-F{;G2)\\ 

G\,G2&Q^m 1T(Gi,G2)^”^ ^ 

Gi^G2 
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Proof of Theorem 3.3. Let e > 0 as in Theorem 4.6.a and set 

Ze = supinf ||F(-,Gi) - F(-,G 2 )||oo 
Gi G2 

where the supremum is taken over all Gi G 0^mnkVGo(^/2) and the infimum 
is taken over all G 2 G \ kVGo(£)- By compactness of Q^m \ kVGo(£) the 
infimum is attained and by identifiability (coming from Assumption B(2m)), 
it is nonzero. Thus, a fortiori, we have > 0. 

Set Ag = {||F(-,Gi) — FjiWoo < ; on A^, by inequality (15) for the 

minimum distance estimator Gn, we see that \\F{-, Gi) — F{-, Gn)||oo < ^e/2. 
Hence, if Gi G G<^m H Wgo(^/2), then Gn is in Wcoi^) and we may use 
Theorem 4.6.a: 

lT(G„,Gi)“-o+i 

Applying Lemma 4.5 with A = A^, G = 2/5 and d = 2m — 2mo + 1 together 
with (16) with z = Ze/2 yields 

EG,W{Gn,Gi) ^ 

5,d 


and bound ( 8 ) is proved. 

Applying the same Lemma 4.5 with d = 2m —1 and Theorem 4.6.b likewise 
yields bound (9). 

□ 

We now give two related results under weaker derivability assumptions, 
but less general. The first is the valid weaker version of Lemma 2 by Chen 
(1995), which is sufficient for the use other authors have made of it. Here, we 
only compare mixtures in a ball with the mixture at the center of the ball. 
The second covers the case where the number of components in the mixture 
is known, and is used for the proof of Theorem 3.5. 

Theorem 4.7. Let Gq G Under Assumption B(2), there are e > 0 

and (5 > 0 such that 


||F(-,Gi)-F(-,Go)|U 

GiGS^^nWcgL) W{Gi,Gof 

Proof. We can follow the proof of Chen (1995, Lemma 2) which holds 
here, because the 7 ^ defined in his paper are all non-negative, and at least 
one is nonzero. □ 
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Theorem 4.8. Let Gq C Qm- Under Assumption B(l), there are e > 0 
and (5 > 0 such that 

||F(-,Gi)-F(-,G2)|Ioo ^ ^ 

GuG2&g^mr^WGoU) W{Gi,G2) 

Gi^G2 

The proof is given in the supplemental part (Heinrich and Kahn, 2015). 

Proof of Theorem 3.5. Consider a fixed mixing distribution Go with 
exactly mo components. Set 

= inf ||F(.,Gi)-F(-,Go)|L 

CtI ty<mQ 

e" = inf ||F(.,Gi)-F(.,Go)|L. 

VF(Gi,Go)^£ 

By compactness and identifiability, s' and s" are attained and positive. Let 
the event An = {||F(-,Go) - Fn\\oo ^ Zn} with 

= 1 (g/ _ „-i/2+«) ^ g// _ 

We first bound W(Gn, Gq) on the event An', we have, by definition (7) of the 
minimum distance estimator Gn,mo on G^moi 

\\F{-,Gn,mo) - ^n\\oo < ||F(-,Go) - F„||oo ^ ^ 

SO that rh is at most mo ; moreover by the triangle’s inequality, for all 
G\ G Q<imQ 1 

||P(-, Gi) - Felloe ^e'-zn> 

so that m is at least mo and thus Gn = Gn,mo £ G^mo- Moreover, 

||F(-, G„) - F(-, Go)||oo ^ 2||F(-, Go) - F„||oo ^ 2zn < e\ 

so that Gn must be in Wcoi^)- Hence, by Theorem 4.8 and (15), we get: 

^^(Gn,Go) ^ t||F(.,G„)-F(.,Go)||oo 
^ ^||F„-F(-,Go)||oo. 

By Lemma 4.5 for A = An, G = 2/S and d = 1 and (16) for z = Zn, we 
deduce 

Ego [iP(G„,Go)] ^ |Y^n-lG + 2Diam(0)e-2-'^ 
and we are done. □ 
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5. Proofs. 

5.1. The coarse-graining tree and Theorem 4-6. Theorem 4.6.b is a con¬ 
sequence of Theorem 4.6.a and compactness and identifiability. Details in 
the supplemental part (Heinrich and Kahn, 2015). 

We split the proof of Theorem 4.6.a into three steps. 


Step 1: selecting {Gi^n,G 2 ,n) o,nd related scaling sequences. We have to 
prove 

||F(-,Gi)-F(-,G2)|L 


lim t 


inf 


Gi,G2€g<imn'WGo(i) 2mo+l 

Gi^G2 


>S, 


for some 5 > 0. Choose for each n distinct mixing distributions Gi^n, G 2 ,n in 
G^m n >Vgo(^) such that, setting AGn = Gi,n - G 2 ,n, 


inf 


||F(-,Gi)-F(-,G2)||, 


+ - ^ 


||F(.,AG. 


nJWoo 


GuG2eg^^nWG,0 1H(Gi,G 2)2—2-0+1 ' n " W(AG„)2— 2 ^ 0 + 1 ’ 

Gi^G2 

We may and do assume that {Gi^n) C Gmi and {G 2 ,n) C Gm 2 for some mi, m 2 
at most m. We can then write 

mi mi-|-m2 

Gl,n — ^ ^ j ^ and G 2 ,n — ^ ^ j,™ 

j=l j=mi-el 

and thus the signed measure AGn is: 

mi+m2 

^Gn = 

1=1 

with 

/ 

(Tr+i.nj^LTn) forjG|l,mi] 

(-T2j,n, 6*2,i,n) for j G |mi -L 1, m 2 l 

Up to selecting a subsequence of AGn, we may find a finite number of scaling 
sequences £s{n), s G lO,^], such that 

(17) 0 = eo(n) < ei(n) < • • • < £s{n) = 1 with £s{n) = o(e 5 +i(n)), 

and such that they are of the same order as the rates of convergence of 
the various \9j^n — for j,j' G |l,mi -|- m 2 ] and \Ylij£j'^j,n\ for J C 

|1, mi -|-m 2 |. That is, there are integers 5{j,j') and 5t^{J) in |0, Sj such that 


9 j,n) — 


(18) 




IGJ 


J,n. 
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Note that the map s(-, •) defined by (18) is an ultrametric on |l,mi + m 2 ] 
(but does not separate points). We also define the s-diameter of subsets J 
of |l,mi + m 2 | by 

s(J) = maxs(j,/). 

jj'eJ 


Step 2: construction of the coarse-graining tree and key lemmas. Consider 
the collection T of distinct ultrametric balls J = {f : ^ s} that we 

can make when j ranges over |1, mi + m 2 ] and s over |0, S']. This collection 
defines the coarse-graining tree we need. Its root is Jo = |l,mi + m 2 | and 
its nodes J satisfy 

J n J' / 0 ^ J C J' or J' C J, 

by the ultrametric property. 

Let us show how the tree T looks like with a partial representation : 


diameter 


N 



Note that the ends are not necessarily singletons since the metric s(-, •) 
does not separate points. 
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We define the parent of a node J C Jg by 

{J C I C J^,I eT) I=J. 

The set of descendants and the set of children of a node J are 

Desc(J) = {/ G r : C J}, 

Child(J) = {/ G r : = J}. 

The following two lemmas are proved in the supplement part (Heinrich and 
Kahn, 2015, Section 3). 

Lemma 5.1. With the above notations, 

(19) W{AGn)- max e 5 ^(j)(n)eg(jt)(n). 

Set now for J C Jo, 


F{x,J) = 

j&J 

so that, in particular, F{x, AGn) = F{x, Jo)- We shall use Taylor expansions 
along the tree T to express the order of F{x, AGn) in terms of the scaling 
functions es{n). 

Lemma 5.2. Let J be a node and set dj = card(J). Pick 9j in the set 
{(^j,n : j G J}. The dependence on n is skipped from the following notations. 
There are a vector aj = (aj(A:))o^fe^ 2 m ond a remainder R{x,J) such that 

2m 

(20) F{x, J) = Y^ aj{k)e\j^F^^\x, Oj) + R{x, J), 

k=0 


where: 

(a) aj(0) = E TTj and \aj{k)\ ^ 1 for all k ^ 2m, 

j&J 

(b) There is a coefficient aj{k) of maximal order among the dj first ones. 
That is, there is an integer kj < dj such that 

||aj|| = max \aj{k)\ x \aj{kj)\, 
k^2m 
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(c) The norm ||aj|| is bounded from below (up to a constant) by a quantity 
linked to the Wasserstein distance: 

(d) The remainder term is negligible. Uniformly in x: 

R{x,J) = o(||aj||e5)) , 

(e) For distinct 1,1' G Child(J), we have \6i — 6i' \ x 


|aj|| > max 


max 
/€Desc( J) 



Step 3: concluding the proof of Theorem f.G.a. Consider the root Jo of 
the tree T and distinguish two cases: 

Case 1: s(Jo) < S. Set for short J = Jq. In this case we have (J) = o(l) 
and may apply directly Lemma 5.2 to J: 

2m 

F{x, AGn) = F{x, J) = J2 aj{k)el^j^F^'^\x, 9j) + R{x, J), 

k=0 


SO that 

(21) ||F(.,AG, 




2m 




k=0 




By using the lower bound (6), we get for all k < dj 


(22) 

2m 

Y,a.j{k)E';^j^F^^\;9j) 

max 

k 





k=Q 

00 





and, taking k = kj, (b) and (c) yield 

-fc 




going on, since dj ^ mi + m 2 ^ 2m, we get 

(23) I aj{k)el( j^ \ ^ ' 


Since R{x,J) is of smaller order by (d), we get from (21), (22), (23) 

_2m—1 


||F(.,AG„ 


)p max £c (t)£ /T-t\ 
/eDesc(J) 


imsart-aos ver. 2011/11/15 file: optimal_mixtures_arXiv.tex date: July 16, 2015 



OPTIMAL RATES FOR FINITE MIXTURE ESTIMATION 


25 


so that Lemma 5.1 gives 


Case 2: s(Jo) = S. We split AGn over the first-generation children: 


F{x,AGn)=F{x,Jo) = 

J€Child{Jo) 



" 2m 

= E 


JeChiid{Jo) 

.A:=0 


Moreover, by (e), the 9j for J G Child(Jo) are e-separated for some e > 0 
so that the lower bound (6) can be applied in the bracket above and yields, 
since the i?(x, J)’s are negligible: 


||F(-, AGn)||..^ max max nl ^ max max nl- 

II V, nyiloo ^ jgchild(J„) fc<2m ' ^ ^ JeChild(J„) fc<dj ' ^ 

If we take k = 0 rather than the maximum for k < dj in the last bound 
above, we deduce from (a) that 




^ ^ max e, (j\, 
JGChild(J„) 


whereas if we substitute to we get from (b) and next (c) 


^n;iioo ^ 




II II — 1 

max \\aj\\eJj^ 
JeChiid(Jo) 


dj — l 


max max (T\e . 
JeChild(Jo) 7GDesc( J) 1 


We may combine the two lower bounds above and, after recalling that 
£b{Jo) = 1 and setting d* = maxjgchild(J„) dj, get 


l|i"(-,AGO|| 


OO 




^dj — 1 


max max (j\t 

JGChild(Jo)/eDesc(J)U{J} 1 

max e„ 

JGDesc(J„) 

> W{AGnf*-\ 


where the last inequality comes from Lemma 5.1. It remains to estimate 
d*. Since and G 2 ,n converge to Gq G Gmo-, the root Jq (of cardinality 
mi + m 2 ) has at least mo children with at least two elements. Thus, the 
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cardinality d* of the biggest child is bounded by mi + m 2 — 2(mo — 1) so 
that 

Finally, if mo is more than one, we are in Case 2 where s(Jo) = S and if 
niQ is one. Case 1 and Case 2 can occur. But whatever the case, we always 
have 

ll^(•,G'OIL ^ lF(AG„)“™o+i 
so that Theorem 4.6.a. is proved. 

5.2. Proof of Theorem f.3. Set d = m — rriQ + 1 for short. Consider 
numbers hq = 1, fii,..., fj, 2 d -2 such that the Hankel matrices {Mk)ij = 
Hi^j -2 satisfy detM^ > 0 for k £ |l,(i — 1]. By Theorem 4.2.b., we may 
then define for any real number u a distribution G{u) = Yl'f=mo '^ji^)^hj{u) 
with initial moments 1, ^ 1 ,..., ^ 2 d- 2 ; M 2 d-i = u. Moreover, the unicity in 
Theorem 4.2.b. implies that, on the set 

,..., TTrf, hi,..., hrf) G : TTi > 0,..., TTrf > 0, hi < • • • < hrfj , 

the following application is injective: 

( d d d d 

X] • • • ’ X] 

111 1 

Now, its Jacobian is non-zero (see Heinrich and Kahn, 2015, Section E): 

(24) J {(j)) = TTi ■ ■ ■ TTd (hj-hfc)^. 

l^j<k^d 

Thus the inverse of f) is locally continuous, so that, in particular, the hj{u) 
are all continuous. Thus, we can set H{U) = max^^^; max|„l^l 7 |hj(u)| which 
is finite for any [/ > 0 and we choose a positive sequence U{n) such that 
[/(n) —)■ 00 and hh([/(n))n“^/^^'^“^^ —)• 0. 

We now define support points 6j,niu) = 9mo + in 0 and 

mixing distributions around Gq by 

mo —1 m 

(25) Gniu) = TTjde. Y 

jr = l j=mo 

Note that Gn(0) and Gq do not coincide. The form of Gniu) makes it clear 
that it converges to Gq at speed it is easily seen that for |tt| ^ U 

WiGn{u),GQ) ^ 
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This proves (a). 

Moreover, since all other points and proportions are equal, the trans¬ 
portation distance W{Gn{u),Gn{u')) is equal to the transportation distance 
between the last p components. Since those support points keep the same 
weights and are homothetic with scale around Omoj we have ex¬ 

actly 

W{Gn{u),Gn{u')) = W{Gi{u),Gi{u'))n-^/^^<^-^\ 

This proves (b). 

We now prove local asymptotic normality. As before, the probability un¬ 
der the mixing distribution Gn(0) is denoted by Pg'„(o) and the correspond¬ 
ing expectation Let ... ,Xn,n be an i.i.d. sample with density 

(•)G'n(0)). Then, we can write the log-likelihood ratio as 


Zn,o{u) = Log 


( YYl=lfiX,,n,Gniu)) \ 

\m=lf{X,,n,Gnm) 


^Log(l -hTi^„(n)) 
i=l 


with 


(26) 

, f{X,,n,Gn{u))-fiXi,n,Gnm 
f{X^,n,Gnm 

Set also 


(27) 

f{X,,n,Gnm ■ 


Using Taylor expansions with remainder, we hnd that Yi^n{u) and are 
centered under Pg'„(o) (see Heinrich and Kahn, 2015). 

Consider now 

n 

(28) Zn = 'Kmon~^G ^ Zi^n- 

i=\ 


By Proposition D.l in the supplemental part (Heinrich and Kahn, 2015), for 
n large enough, we have 1- Up to taking a subsequence, we 

may then assume E^'^^o) \Zi,n\^ —?• for some positive a. By Proposition D.l 
again, we have E^^^o) ^ 1 for all n large enough. 

We may then apply Lyapunov theorem (Billingsley, 1995, Theorem 23.7) 
to prove that 


(29) 


with T = a\l,^. 
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Indeed, the Lyapunov condition holds: 

Er=lEG„(0)l^*,n|' 




[Z^i=i®Gn(0) 




3/2 




SO that 



-> 0 

n^oo 


EILi 


ELi 


\/l]r=l®'G„(0) \Zi,n? ^JnV.G„{0) \ Zi^r, 


4 Ar(0,l) 


and (29) follows easily from (28). 

2 

Now, to get the convergence in probability of Znfl — uZn + \T to zero, 
we show in the supplemental part (Heinrich and Kahn, 2015) the following 
convergences for all u: 


n 

(30) An{u) = ^ Yi^niu) - uZn 

i=l 
n 

(31) = 

i=l 

n 

(32) Cn{u) = ^\Yi,n{u)\^ 

i=l 

Then, setting 

n ^ 

Dn{u) = Znfi{u) - ^ Yi^n{u) + Yi^n{uf, 
i=l i=l 

we have, since |Log(l + y) - y + y^/2| ^ |?/p for \y\ ^ 2/3, 

\Dn{u)\ ^ Cniu) 

with probability going to one, so that 

1 

Znfi{u) - uZn + —T = An{u) + -Bn{u) + Dn{u) 
tend to 0 in probability. 
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SUPPLEMENTARY MATERIAL 

Auxiliary results and technical details 

(doi: 10.1214/00-AOASXXXXSUPP; .pdf). This supplemental part gathers 

some proof details on some assertions given in the paper. 
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APPENDIX A: AUXILIARY MATRIX TOOL 

Lemma A.l. Let j, di and d be positive integers such that Ylil=i = d. 
Consider numbers 9i, - ■ ■ ,9j all distinct. Write 


I = {{i,i) £ N : I ^ i ^ j,l ^ i ^ di} . 
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Define for each (i,£) G I a d-dimensional column vector as follows: 


ai,e[k] 


et' 

{k-£)\ 




1 ^ k ^ d, 


and stack these vectors in a d x d matrix 


(33) A{ei, ...,6j)= [ai,i| ... |ai,dil • • • fijfi ... • 

Then, the rank of A{6i,... ,0j) is d. 


Proof. Set for short A = A{9i, ..., 6j). Let A = ^>6 a vector 

such that AK = 0. Proving the lemma is equivalent to proving that A = 0. 
Note that for each k 

_ 

(i/)ex (i,pGX ^ '' 

so that for any {d — l)-degree polynomial P{x) = Ylk=o^kf^ i we have 

d-l 

(34) (co,... ,Crf_i)AA = ^ CA:(AA)fc+i = ^ \i^eP^^~^\6i) = 0. 

k=0 (i,i)&X 

Set Pi{x) = ULiix — for each i G Choosing successively in (34) 

k^i 

the following polynomials 

P{x) = (x - 6»i)'^*“^Pi(x), 

P{x) = (x - 9i)‘^*~‘^Pi{x), 


P{x) = {x - 9i)^Pi{x), 

yields successively Aj^^. = 0, = 0) •••) ^i,i = 0 and we are done. □ 

Corollary A.2. Let e > 0 and define the set of e-separated vectors in 
by 

T^e = ■ Vi 7^ i', \9i - 9i'\ e} . 

For any vector A G and any vector (0i)i^i^j G 


||A(0i,...,0,)A||x||A||, 


where A(9i ,..., 9j) is as in (33). 
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Proof. Note that the norm ||^(0i,..., 0j)A|| is a continuous function of 
((01,..., A) on the compact space x ^(0,1) where ^(0,1) is the d- 
dimensional unit sphere. Its infimum and supremum are attained on x 
5(0,1), say at and A*) . Now, by Lemma A.l, 

c*(e) = ||A(0*i, • • •,0*j)A*|| and c*{s) = ||A(0^,...,0*)A*|| are positive so 
that c* ||A|| ^ ||A(0i,... ,0j)A|| ^ c* ||A|| for every A and every in 

V, . □ 


APPENDIX B: WASSERSTEIN DISTANCE AND MIXTURE ON THE 

TREE r 

B.l. Key lemmas 5.1 and 5.2. Set for any function / on 0 and any 
J C Jo 

(35) /(J) = ^ TTj^nf {Oj,n) ■ 

3&J 

In particular for /(•) = F{x,-), we have 

F{x,J) = ^7rj>F(x,0j-„), 

j&J 

so that F{x,AGn) = F{x,Jo)- Set also for short 

(36) 7r(J) = ^7rj. 

j&J 

Proof of Lemma 5.1 . With the above notations, we have to show (19). 

In what follows n is fixed and thus skipped in the 0/s, tTj’s and e^’s. 
Eor each distinct J, we pick an arbitrary j £ J and set 0j = Oj. Let / be 
1-Lipschitz on 0. We first prove by recurrence that for any node J of the 
tree. 


(37) 


/(J) ^ 7r(J)/(0j) + 


max 
7€Desc( J) 




If J has s-diameter zero, then /(J) = 7r(J)/(0j) and (37) is satisfied. Next, 
if J has children Ji that satisfy (37), we compute 


f{J) = E 

JieChiid{j) 

^ E 

JieChiid(j) 


vr(Ji)/(0Ji) + max 

/GDesc( Ji) 


^ vr(J)/(0j)+ |7r(Ji)| J/(0ji)-/(0j)|+^max^ 


JieChild(J) 
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Since |7r(Ji)| is of order £ 3 ^(j^) and \9j^ — 9j\ is of order £s{Ji) see that 
(37) holds for J and in particular for Jq where Tr{,Jo) = 0. 

To prove the reverse inequality, let J ^ Jo such that maxi¬ 

mal. Set 


e( J) = min \9j — 9j\ and c)(J) = max \9j — 9j\ 
iiJ j&J 

so that e(J) ^ h(J). Consider the following 1-Lipschitz function / on 0 
f{9) = -sgn(7r(J)) X min{e(J) - c)(J), [\9 - 9j\ - t)(J)]+} 

so that 

/(J) = 0 and f{Jo) = f{Jo\J) = HJ)MJ)-d{J)]. 

Since |7r(J)| is of order £s^{j) and e(J) is at least of order £s(jt) and t)(J) is 
of order £s(j), we deduce 

f{Jo) > max 

It remains to note that W (AGn) = sup||j||j^.p^i f{Jo)- 

Proof of Lemma 5.2 . We shall use Taylor expansions along the tree T to 
express the order of F{x, AGn) in terms of the scaling functions £s{n). Recall 
Assumption B(A:) in the main paper: the densities family {f(-,9),9 G 0} 
satisfies, with F{x,9) = J/^/(-,0)dA, 

• {T(-,0),0 G 0} is A:-strongly identifiable, 

• For all X, F(x,9) is /c-differentiable w.r.t. 9, 

• There is a uniform continuity modulus a;(-) such that 

sup \F^^\x,92) - tW(x,0i)| ^ u,{ 92 - 9i) 

X 

with Lj{h) = 0. 

Recall notations (35) and (36). If J is an end of the tree T, then it satisfies 
s(J) = 0, all the 9j for j G J are equal, and F{x, J) = tt{J)F{x,9j). In this 

case, the choices aj{k) = 7r(J)l{;j=o} R{x,J) = 0 work. 

Assume now that Lemma 5.2 holds for any node I with parent J = in 
the tree F. We want to pass the estimates of I to the parent J. By assumption 
on /, 

2m 

(38) F{x,I) - R{x,I) = Y,aiii)4ii)F^'\^^(^i)- 

i=o 
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Assuming without loss of generality that 9j ^ Oj, we apply Taylor’s formula 
with remainder to F^^\x,9i) at 9j and obtain 


2m—1 


F^^\x,9i)-Y, 

k=e 


{9i - 9jf-^ 


F^^\x,9j) 


r^i{9i 


F^^^'>{x,C)dC. 


So that using Assumption B(2m), 


(n n \k—£ 


k=e 


{k-iy. 
{9i - 0 


2m—l—i r 


ie, (2m-l-£)! L 


F^^^\x,0- F^^'^\x,9j) 




= O ( sup |FP”'feO-F<^”'>(x,e^)|') 

(2m 1 tj! y^e[6»j,0j] J 

= O [{9i - 9jf^-^) , 


and by setting 


(39) 

we obtain 


^1 = 


9i-9j 

£b(j) 


and a'j(i) = ai{i) 


's(J) 


2m ak-£ 

and substituting in (38) and changing the order of summation, we get 


2m 


F{x,I) - Rix,I) = Y,el^^j)F^^\x,9j)Y,a'i{i) 




k-i 


k=0 


£=0 


(k-iy 


+ ™ax|aj(£)|o(l). 

^ ' i^2m 


Adding up over the children I of J, we obtain 

2m 

F{x,J) = j;aj(fc)e,^(^)FW(x,0j) + ii(x, J), 

k=0 


imsart-aos ver. 2011/11/15 file: optimal_mixtures_arXiv.tex date: July 16, 2015 



OPTIMAL RATES FOR FINITE MIXTURE ESTIMATION 


35 


with 


(40) 

a,j{k) 

(41) 

R{x, J) 






k-l 


/GChiid(j) e=o 


/GChild(J) 




l<2m 


Proof of (a) for the node J = . From (40) for k = 0 and (39) and 

recurrence hypothesis on I, we have 

= Y1 = Y1 «-f(0)= Y1 = 


/GChild(J) 7GChild(J) /eChild(J) j&I j&J 

Moreover, since jd/l ^ 1 for each child I of J, Equation (40) yields 

\aj{k)\ 4 max \a'i{i)\. 

/eChild{J) 

Furthermore, from (39) we have |a)-(£)| ^ |a/(^)| since 63 ( 7 ) ^ t)y 

assumption on I, we have |a 7 (^)| ^ 1 so that |a 7 (£)| and thus \aj{k)\ are also 
of order one and (a) is established. 

We turn to the proof of (h) for J = . The first step is to show that 

(42) max|aj(A:)| X ma,x |a 7 (^)|. 

k<dj i<di 

/eChild(J) 

>From (40), write aj{k) = {k) + pp (k) with 


( 43 ) aPik) = 

/GChild(J) e=o ^ '' 

(44) af{k) = Y. E 

/GChild(J) ^ '' 

For any two distinct children I and I' of J, (39) gives 

(45) \'di -'dr\ = £fP\9i - 9r\^l, 

so that {i?/} 7 gchiid(j) is ^-separated for some e > 0. Hence, by Corollary A.2, 
if we set A = S®* 


max 

k<dj 


aPik) 


max \ai{i)\. 

i<di 

/eChiid(j) 


imsart-aos ver. 2011/11/15 file: optimal_mixtures_arXiv.tex date: July 16, 2015 



36 


P. HEINRICH AND J. KAHN 


Now, to obtain (42), we see from (40) that it’s enough to show 

(46) Tiia^ I u j 

k<dj I ^ I 

Since |i? 7 | ^ 1, we have from (44) 

(47) 


aj\k) = o fmax|a7(£)| 


a^j\k) ^ max laUl)]. 

di^£i:k 

By assumption on I, we also have ||a/|| x max 7 <c;^ \ai{i)\, 

l+i 

\\ai 


^Bil) 


max|ar(f')| = max I a/( 
£«( J) ^<^1 


MI) 


MJ) 


so that 




MJ) 


di 


^ max 


where the last inequality comes from (39). Thus, 


(48) 


max \a'j{t)\ = o ( max|a 7 (£)| ) , 


dj^i^k 


\i<dj 


so that (47) and (48) yield (46) and (42) is proved. 
The second step is to prove 


(49) ||aj|| X max |aj(A:)|. 

k<dj 

The non-trivial part is ||aj|| ^ max^^^^j |aj(/c)|; it is equivalent to show 


max Iaj(/c)I ^ max|aj(/c)|. 

k^dj k<dj 


By the definition (40) of aj{k), (48) and (42), we have 


max |aj(/c)| 

k^dj 


4 


max > 

k'^dj 

/GChild(J) 


max |ar(£)| 

iikk 




E max ar(£) ^ 

Kdi 

/eChiid(j) 


max I a 7 (£) I ^ max I a j(/c) I. 

i<di k<dj 

7GChild(J) 


The proof of (b) is complete. 

We turn to the proof of (c) for J = M. From (49), (42) and (39), we get 


llajll 

^ max aj(A:) 

k<dj 

max |a7(£)| 

Kdj 

/GChild(J) 


max 

i<di 

7GChild(J) 

i«/(di 

/ £5(7) 

\MJ) 

(50) 




max 

7GChild(J) 

ll»dl ( 

£ 3 ( 7 ) \ 
Ml) / 
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Moreover (c) for I G Child(J) gives since dj ^ dj 





max 

/'eDesc(/) 




max 

/'eDesc(/) 




V ^s{i) 

V £b{J) 


dj — 1 


dj-1 



In addition, (a) implies ||aj|| |aj(0)| = |vr(J)| x and similarly, from 

(42), (49) and (a) for I, ||aj|| ^ |a^(0)| = |a/(0)| = | 7 r(/)| x 63 ^( 7 ) so that 
(c) is established for J. 

We finally prove (d) for J = From (41), (48), assumption (d) for I and 
(42), we have 


R{x, J) ^ max 


/GChild(J) 

^ max 
/GChild(J) 


max|a^(£)|o(l)+ i?(x,/) 
4 ^2™ ||^_^|| ^(1) ^ ^ (lla/lle^-) , 


and in addition, for each child I of J, from (50), 




dj — l 2mfi-l—dj 
S(J) S(7) 




2m 




and we are done. 

The proof of (e) is already established in (45). 


APPENDIX C: FROM LOCAL TO GLOBAL: THEOREM 4.6.a 

IMPLIES THEOREM 4.6.b 


Set 


||F(-,Gi)-F(-,G2)|| 

Gi,G2eg^^ IT(Gi,G2)2™-i 
GWG 2 


and consider a sequence {Gi^n, G' 2 ,n) in with Gi^n G 2 ,n for each n and 
such that 


(51) 


||F(-,Gi,J-F(-,G2,n)IL 

VF(Gi,„,G2 ,„)2 —1 


-^ L. 

n—>-OD 


We can assume that (Gi^n, G 2 ,n) converges to some limit (Gi^oo; ^ 2 , 00 ) in 
the compact set Q\,^- Set w = VF(Gi^oo) ^ 2 , 00)1 ^Gn = Gi^n — ^ 2 ,^ and 
distinguish two cases : re > 0 and w = 0. 
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If It; > 0, by identifiability, there is a xq such that Sq = |-F(xo, Gi^oo) — 
F(xo,G 2 ,oo)| > 0. Then, for all n 

> \F{xo,AGn)\ 

^ ’ lT(AGn)2™-i ^ lT(AGn)2"^-i ■ 


By assumption, IT(AG„) tends to w. Moreover, the numerator of the r.h.s. of 
(52) tends to ho since the function 9 i—)■ -F(xo, 0) is iC 2 ;p-Lipschitz with Kxq = 
max0g0 |F(i)(xo,6»)|. As a consequence, (52) and (51) give Theorem 4.6.b. 
by choosing 5 = 6o/w‘^^~^. 

If now w = 0, set Go = Gi^oo which is in Qmo with some rriQ at most m. 
Consider e > 0 and h > 0 as dehned in Theorem 4.6.a. ; for n large enough, 
say n ^ no, IT {Gi^n, Go), ^ = 1,2, are less than s so that 


.nf 

n'^no lT(AGn)2™-2™0+l 


Moreover, for n large enough, say n ^ ni, 1T(AG„) is smaller than one so 
that lT(AGn)^™“^™°'''^ is more than lT(AGn)^™'“^ and thus for all n ^ 
no + ni, 

^F^AG^> ||F(-,AG.)|L ^ , 

lT(AGn)2"^-l ^ n>no+ni 1T(AGn)2”^-2«^0+l 


which gives L ^ h in the limit and Theorem 4.6.b. in that case. 
The proof of Theorem 4.6 is complete. 


APPENDIX D: (P, Q)-SMOOTHNESS 

D.l. Inherited smoothness for mixing distributions. Being (p,g)- 
smooth ensures hniteness of similar integrals when some 9j are replaced with 
mixing distributions with components close to the Oj: 

Proposition D.l. Assume that the family {f{-,9),9 G 0} is {p,q)- 
smooth and let £ > 0 as in Definition 2.1.2. Let also ttq > 0, 6q £ Q and 
positive integers m, mo with m tuq. Define mixing distributions 

m 

Gn = 

1=1 


such that 

• For all j £ |mo,m], 9j^n -> Oo, 

n—>-oo 

• For all n large enough, ^ "To- 
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Then for any O' satisfying \ 6' — 9 q\ < e/2, for any mixing distribution G: 


(53) 


Eg 




^ 1 


forn large enough. If, in addition, the function x i-A has nonzero 

integral under \, then for any mixing distribution G, 


(54) 


Eg 




fi;G) 


> 1 . 
^0 


Proof. For large n and all j G |mo,w,], we have \0j,n ~ ^ for all O' 

such that \0' — 0o| < e/ 2- For all such (j, n) and all 0, by (p, ( 7 )-smoothness 
and compactness and continuity, there is a finite G such that 

li 

I fyuji. H'\ 

Ee 


/(■) 


€ G. 


Since f{x, G) is a convex combination of some f{x, 9), we may replace Eg by 
Eg in the former expression. Since the function l/y”^ is convex on positive 
reals, by Jensen inequality, setting A = YlJLmo 


E 


j=mo 


TT 




f^\x,0') 


fix, 9j,r 




&\x,9') 


Z^j=mo A JyX,t7j^ 
respec 

f(p)f^e' 




&\x,0') 


fix,G, 


and taking expectations with respect to G we obtain the upper bound: 

F 

I tunt- H') 

Eg 


fi;G„ 




A<i 


VTA 


The lower bound does not depend on {p, g)-smoothness. It is a simple 
consequence of rewriting: 


Eg 



Gf 

&\x,0o)i 

fi;G) 

J 

/(x,G)'^-i 


dA(a:). 


By assumption, there is a set B of measure A(iJ) = M > 0 on which the 
function f^\x, 9 q) is more than some J > 0. Now, for M small enough, the 
set B D {fix, G) ^ 2/M} is of measure at least M/2 and thus 


&\x,9or 


/(x,G)'?-i 


dA(x) ^ 


M' 


9+1 


5T 


□ 
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D.2. (p, q)-smoothness of exponential families. Given our defini¬ 

tion of (p, g)-smoothness, it only makes sense to consider one-parameter one¬ 
dimensional families. However, generalisation to higher dimensions should be 
easy. 

Let us consider an exponential family with natural parameter 0 G ©o C M, 
so that 

f{x,e) = h{x)g{e)e^p{eT{x)), 

with g G C°° and a sufficient one-dimensional statistic T{x). Consider 0 such 
that its ^-neighbourhood 00H(O,e) is included in ©q. Then {f{-,9),6 G 0} 
is (p, ( 7 )-smooth for any p and q. Indeed, 


f^\x,6') = /i(x)e^ 




f^P\x,0' 

g(0'-e")T(D 

fix,9") 

9i0") 

f(p\x,9') 

^q{e'-e")T(x) 

fix, 9") 

gHO") 


LA:=0 
P 




k=0 


so that we get from (5) 


Ep^,{e,e',0'') = 


g<i{e'')g{e + q{e'-e'')) 


Since all the moments of the sufficient statistic T(x) are finite under a 
distribution in the exponential family, and since 0 + q9' — q9" is in ©o for 
{9' — 9") < s/q, we obtain the finiteness of Ep^q{9,9', 9"). Continuity is clear. 


APPENDIX E: JACOBIAN CALCULUS 

The map 

/ d d d d 

t> : {TTl,...,TTd,9l,...,9d) ^ E ""U E • • • ’ E 


2d-l 

j 


\ 1 1 

defined on has the following Jacobian : 

(d-l)d 


J(0) = (-1) 2 Tri-'-TTrf {9j-9k)'^. 

l^j<k^d 
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To prove this, note that 

1 ... 1 0 ••• 0 

6 l ■ ■ ■ Od VTl • • • TTd 

■■■ 27ri6'i • • • 27Td9d 

g2d-i ... (2(i _ ... (2(i - l)7rrf0“-2 

so that J{4>) = ■ ■ ■ 'n'd with 

1 ... 1 0 ••• 0 

01 ■■■ Od 1 • • • 1 

A,= Gf ••• Oj 29, ••• 29d 

g2d-l ... (2(i_l)02rf-2 ... (2(i_l)02rf-2 

Note that, if P is any polynomial of degree 2d — 1 with leading coefficient 
one, the last row of can be replaced by 

[Pi9,)---Pi9d) P\9i)---P'i9d)], 
and choosing P{9) = {9 - 9d) ~ g®* 

1 ... 1 0 ••• 0 

9, ■■■ 9d 1 • • • 1 

Ad = P'ied) 201 ••• 20,_1 

g2d-2 ... ^M-2 (^2d-2)9l^-^ ••• {2d-2)9ltf 

Again, if Q is any polynomial of degree 2d — 2 with leading coefficient one, 
the last row can be replaced by 

[Q{9i)---Q{9d) Q'{9i)---Q'{9d-i)], 

and choosing Q{9) = (0 — 0j)^, we obtain the recurrence formula 

i<:j^d-i 

d-l 

Ad = {-lf-^P'{9d)Q{9d)Ad-i = 

f=i 
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By iteration, we get 

k=2 j=l 

= (- 1 )"^ n 

l^j<k^d 

since Ai = 1. The proof is complete. 


APPENDIX F: TAYLOR EXPANSIONS AND L^-CONVERGENCES IN 

THE PROOF OF THEOREM 4.3 

The Yi^n{u)’s and Zi^n^s are centered. Recall the definition (13) of 
Gn{u) and set for short 

Oj,n = Omo + 


with d = m — mo + 1. By definition of the mixtures, we have 

m 

f {x, Gn{u)) - f (x, Go) = TTmo ^ TTj-„(«) [/(x, 9j^n{u)) - /(x, Omo)] , 


j=m.o 


and by Taylor expansion with remainder, 

2<I—1 / i, / 'i \ fc 
hj [u) 


za —1 / 

f{x,ej^n{u)) - f{x, 9 mo) = X] ( 

fc=l 


„l/(4d-2) 


'mo; 


+ 




d0, 


so that we get by linearity 

(») + R,,o.u) 

with 

fd-j niu) (a, _ a\2d—l 

Rn{x,u)= ^ [ f^‘^^\x,9) —-- d9. 

[Zd-Ly. 


j=mo 


^rriQ 


Since the moments yi,..., y 2 d -2 that do not depend on u but y 2 d-i = u, 
substracting (55) with u = 0 from (55) yields 


f{x,Gn{u)) - f{x,Gn{0)) 
TTmo 


^)i^x,9mo) + Rn{x,u) - Rn{x,0). 
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Dividing by f{x,Gn{0)), recalling (27) and setting 

p / \ _ Rn{Xi,nj u) 

we see that (26) can be written as 

— '^mo ^ ^i,n Ri,n(fi) 

Moreover, for each fixed n and u, the i.i.d. vectors (Yi^niu), Ri,niu)) are 
centered under Gn(0). Indeed, from (26), we have 

IEG„(0)^Ln(^i) = j[f{x,Gn{u)) - f{x,Gn{0))]dX{x) = 0; 

furthermore by expanding / around Omoi dividing by f{-,Gn{0)), taking 
expectations and applying Fubini Theorem to the remainder, we get 


2i^l 


0 = V t-e, 
^ k\ 


k=l 


^G„(0) 




mo> 


+ 


f{Xi,n,Gnm 

^^0+h (0^^ + h- 


{2d-iy. 


f{Xi^n,Gn{0)) 


dO; 


Proposition D.l ensures that each expectation exists, that Fubini Theorem is 
valid and that the remainder term is of order Thus, we deduce iteratively 
that for k £ |1, 2(i — 1] 


E, 


'Gn(O) 




mo> 


f{Xi,n,Gn{0)) 


= 0 


and in particular EG„(o)Ej^n = 0. And dividing (55) by /(x,Gn(0)) gives as 
a result EG„(o)-Rj,«('*^) ~ ^ for all u. 

L^-convergences of An{u), Bn{u) and Cn{u). We show the con¬ 
vergences (30), (31) and (32): 


0 

n ^ ^5 


X^n{u) = ^Yi^n{u) - UZ, 
i=l 
n 

Bn{u) =Y,Yi^n{uf - U^T A 0, 

i=l 

n 

Cn{u) = Y,\yiA^)f — 


0 . 


2 = 1 
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Recall the quantities: 


(56) Rn{x,u) 

(57) Yi^niu) 

(58) Z, 


rSj,n{u) (a, 

'=mn ^™0 


{6j,n{u) - 6) 


2d-l 


j=m.o 


(2d - 1)! 


-dd, 


TT- 


mo 




un + Ri^n{u) - Ri,n{0) 

n 

- 1/2 


2=1 


Recall also in the following computations that for each fixed n and u, the 
i.i.d. vectors {Yi^n{u), Zi^n, Ri,n{u)) are centered under Gn(0). 

Proof of (30). Note that from (57) and (28) 


A=1 


2 = 1 


— '^mo I ^ ^ -^2,n(^) ^ ^ Ri,n (0) 

and the equalities 

n 

^ ^ Ri,n{^U] 


E, 


G„(0) 


i=l 


E®'G'"(0)^7n(«)^ = nEG„(o)\Rl,n{'^)\‘^ 


2=1 


will give the desired L^-convergence if we can prove that for each u, 

1 ' 


(59) 


IEGn(0)|77l,n('w)| —O 


n 


To this end, we look at the expression (56) of Rn{x,u) for fixed u. We have 
|0/,n('w) ~ ^ for any 6 in the integrand, any j and n. 

We may thus write 


m 

r8mQ+H{u)n 43=2 


Rn{x,u)\ ^ E ^/(“) 

/ 

/(2'^)(x,d) 

j=mo 

J9ruQ-H{u)n J3=^ 



(2d - 1)! 


dd 


^ n 


- 1/2 


j - 9 mQ + H ( u)n 43=2 


/(2'=')(x,d) 


dd. 


Since we have cj-finite measures, we may use Fubini theorem. Since moreover 
d in the integrand is between d^o and 9j^n{u) which converges to d^p, we 
may then apply Proposition D.l. For q G |1,4], using convexity of x i-A x'^ 
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on line two, we may then write: 
Eg„( 0 ) ^ n 




(60) 


9 9-1 

^ n 2 4d-2 

U 


9 9 

^ n 2 4d-2 


|6'-6»^Q|^n 




1 ®G„(0) 

“33=^ 




/(•,G„(0)) 


d0 


^1 


Take g = 2 to obtain (59) ; the proof of (30) is complete. 
Proof of (31). Write 


Bn{u) = bI{u) + bI{u), 


with 


Bl{u) 

Bl{u) 


2 = 1 


2 2 

.2 _ 


n 


72 


2=1 


2 2 ^4 

U TT^o 


n 


Y^zl^-u^r. 


2=1 


2u7T. 


2 n 


t,n 1 


Note first that from (57) and (58), 

n 

Biiu) = Y.^R^,n{^) - Ri,nm^ + " 7?i,n(0))^, 

2=1 ^ 2=1 

so that taking the L^-norm and by the Cauchy-Schwarz inequality, 

^^GniO) \Bi{u)\ 4 nEG-„(0) [|-Rl,n('w)|^ + |7?l,n(0)P] 

u 

+ Y^«IEg„( 0 ) [\Rl,niu)\‘^ + \Rl,nW]^J'Z^G40)Zlri 

and the r.h.s. tends to 0 by (59) and the fact that Eg„(o)-^i n Besides, 

setting 5n = |7r^gEG„(o)^i,n - r|, we have 


®^Gn(o) |-Bn('w)| ^ E^,. 


( 0 ) 




i=l 


+ 


4 n VarG„(o)(^i,n)+ <5, 
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which goes to zero since (5,^ —)■ 0 by definition and ^ ^ 1 by Propo¬ 

sition D.l. We have thus, 

which proves (31). 

Proof of (32). It is easily seen from (57) that 


Cn{u) ^ n-3/2 V \Zi,nf + V \Ri,n{u)f + E l^7n(0)P 

u ' ^ ^ ^ 

2=1 2 = 1 2=1 

so that taking expectations 

IEG„(0)|C'n(^i)| ^ n -|-nEG'^(O) [|i?i,n(^i)|^ + |-Rl,n(0)|^] . 

U 

But each of the three terms in the r.h.s. tends to 0: the first one because of 
®"G„( 0 )^ 1 by Proposition D.l, the second and the third ones because 
of (60) for q = 3. Thus Cn{u) converges to 0 in L^. 


APPENDIX G: PROOF OF THEOREM 4.8 

Assume without loss of generality that Gq = with vTj^o ^ 

TTmin.o > 0 and 9i+ifl — 9ifi ^ Ko for all f, with kq > 0. 

Then, with e > 0 small enough, any mixture G in Q^rn O Wcgi^) must 
have exactly one component close to each 9i^o, with a weight of order one. 
More precisely, for s = 


(61) G G (~^y^Goi^) 


m 

= TTiuy. , 
2=1 




with TTj ^ and \9i 


9i,o 


< 


2 ■ 


Indeed, by the very definition of IT (not the dual form), there is a probability 
measure 7r(-, •) on 0 x 0 with marginals Gq = 7r(-, 0) and G = 7r(0, •) such 
that 


(62) IT(G, Go) = ^ \9i,o - 9 j\7t ({0,,o}, {9j }). 

*j=i 


Set Jj^o = {j ■ \^i,o “ < '<^ 0 / 2 } for each i G [l,m]. Then, from (62), for 
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each i, 


W{G,Go) > 

> Y E 

^ '^i{0i,o},{0j}) 

^min,0 

Thus, if W{G, Go) < 7rmin,o'^o/4, then we must have for each i, 

(63) E T{«.,o).{«^})>^ 

j£A,o 


2 



and each Jj^o is non empty. Furthermore, the (disjoint) Jiao’s, i G |l,m], 
are all singletons ; otherwise there would be at least one Ji^ empty, since 
Go has exactly m support points and G at most m. Considering a suitable 
numbering for the components of G, we can thus write Jj^o = {^} so that 
16*1,0 - 0i\ < «^o/2 for each i and (63) yields vr* ^ {{Oifi],{6i]) > Tmin,o/2. 

Now, set 

L = inf ||F(.,Gi)-F(.,G2)|U _ 

1T(Gi,G2) 

Gi^G2 


Select sequences of mixing distributions Gi^n 7^ G' 2 ,n in G^m C Wcois) such 
that: 


||F(-,Gi,0-F(-,G2.n)|L 

VF(Gi,„,G2,n) 


-> L. 

n—>-OD 


We have to prove that L > 0. Actually we shall prove that for n large enough. 


(64) 


||F(-,Gi,J-F(-,G2,J|L 

bF(Gi,„,G2,n) 


1 . 


Up to taking subsequences, we can write Ga,n = YlJLi T^j,a,n0j,a,n with the 
convergences TTj^a^n ^j,a,oo and 0j^a,n 0j,a,oo, for a G {1,2}. Note that 
Ga,oo = ^i,a,oo56»j,„,oo ifos in G^m C Wcoi^) and thus satisfies (61). In 
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particular, the 0j,a,oo’s are Sa-separated for some > 0 ; this will be used 
for a = 2 below. 

Note first that for any 1-Lipschitz function / and any mixing distributions 

G,G'eg^m, 


fd{G - G') 


^ ~ ^'j \ Diam(0) 

i=i 




so that 


{Gi^n, G2^n') ^ ^ ^ |^jr,l,n ^jr,2,n| “1“ \'^j,l,n 

i=i 

To obtain (64), it remains to prove that for large n, 


(65) ||F(-,Gi,n) - F(-,G2,n)|loo ^ ^ “ 0j^2,n\ + “ '^j,2,n\ 

f=l 

By Taylor expansion of F(x,9j^i^n) around 6 j^ 2 ,n and Assumption B(l), 


(66) F{x, Ggn) - F{x, G2,n) = Sn(x) + O ( “ 0j,2,r 

vi=i 


with 


m 

^n(^) — ^ j,2,n)F(x, 9j^2,n) T j,1 ,n{9j9j^2,n)F {x,9j 2,n)- 

f=l 

In addition, by convergence of 9j^a,n to 9j^a,oo for each j and a = 1,2, there 

is an integer no such that for all n ^ no, each ( 0 j, 2 ,n)i^j^m is-separated 

and TTj,i,n ^ TTj, 1 , 00/2 for each j. So that by Proposition 2.3, for all n ^ no, 

m 

||^n(')lloo ^ ^ ^ \'^j,l,n '^j,2,n\ H ^ \9j^l^n 9j^2,n\- 

i=i 

Since (61) holds for Ggoo = we have 71 ^, 1,00 ^ 7rmin,o/2 

for each j. Thus, for all n ^ no, 

m 

||^n(')lloo ^ ^ ^ |7I’j,l,n 7rj,2,n| T 

i=i 

Combining this last inequality with the sup-norm of (66) gives (65). This 
ends the proof. 
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